DAT405/DIT406 Introduction to Data Science and AI

Assignment 1: Introduction to Data Science and Python

Khushi Chitra Uday

Exchange student from CY Tech - France to GU CSE Department

7 hours spent

Anh Thu DOAN

Exchange student from CY Tech - France to GU CSE Department

8 hours spent

Group 26

Q1) GDP per capita and Life Expectancy

From the plot above you can see the amount of missing data per column in the dataset. The columns '145446-annotations' and 'continent' have the most number of missing data.

1. a. Write a Python program that draws a scatter plot of GDP per capita vs life expectancy. State any assumptions and motivate decisions that you make when selecting data to be plotted, and in combining data.
  1. We chose 2018 as the most recent year for which we could find data on the 'GDP per capita' column.
  2. Because we feel that population has an impact on a country's GDP, we decided to include population as one of the elements in our research.
1.b Consider whether the results obtained seem reasonable and discuss what might be the explanation for the results you obtained

From the scatter plot we can interpret that the increase in life expectancy is accompanied with the increase in Gross Domestic Product per capita income. We also discussed that the inclusion of population growth rate would be an important factor contributing towards GDP and if this is also included our interpretation could be more precise.

1.c Did you do any data cleaning (e.g., by removing entries that you think are not useful) for the task of drawing scatter plot(s) and the task of answering the questions d, e, f, and g? If so, explain what kind of entries that you chose to remove and why.

Yes, we did some data cleaning. In the beggining we used msno.matrix in the dataframe to create a axessubplot to check missing values in all the columns where we could see that the columns 145446-annotations' and 'continent' had the most number of missing data. Since we didn't have to use those columns for the other questions we created another dataframe and selected the columns we needed to work with(dfa). We also chose 2018 as the most recent year for which we could find data on the 'GDP per capita' column.

1. d Which countries have a life expectancy higher than one standard deviation above the mean?
1.e Which countries have high life expectancy but have low GDP?
1.f Does every strong economy (normally indicated by GDP) have high life expectancy?

We assume that a strong economy is equal to countries with higher GDP per capita than one standard deviation above the mean.

We took the highest and the second last GDP per capita to compare their life expectancy; Qatar had 153764.1643(USD) GDP per capita in 2018 but only had 80% Life expectancy, while Belgium had 39756.2031(USD) GDP per capita but have higher life expectancy than Qatar (81.468%).

We can assume that not every strong country have a high life expectancy.

There is a positive linear correlation between life expectancy and GDP per capita from the correlation matrix. Although, as we can see from e and f, some countries have low GDP but have high Life expectancy, and some countries have the highest GDP but do not have the highest Life expectancy. Therefore we can conclude that life expectancy depends not only on the GDP per capita.

Q2) Happiness and life satisfaction, trust, corruption.

Link of the dataset: https://ourworldindata.org/happiness-and-life-satisfaction

Data set description: The Human Development Index (HDI) is an index that measures key dimensions of human development. The three key dimensions are:1

– A long and healthy life – measured by life expectancy.

– Access to education – measured by expected years of schooling of children at school-entry age and mean years of schooling of the adult population.

– And a decent standard of living – measured by Gross National Income per capita adjusted for the price level of the country.

This entry provides a basic overview of the Human Development Index over the last decades using the standard HDI methodology of the UNDP.

In addition we are looking at long-term development by relying on the Historical Index of Human Development (HIHD), developed by historian Leandro Prados de la Escosura.

The metrics of the HDI and HIHD are similar, but differ slightly in how they are used to derive the development index – details on these measures can be found in the Data Quality & Definitions section below.

Link of the data set: https://ourworldindata.org/human-development-index

Average rating of perceived corruption in public sector, 2013

CORRUPTION PERCEPTION RATING

Variable description: Average of all individuals' perception ratings on a scale from 1 (corruption is not a problem) to 5 (corruption is a very serious problem).

Variable time span: 2013 – 2013

Data published by: Transparency International - Global Corruption Barometer

Data publisher's source: Population surveys

Link of the data set: https://ourworldindata.org/corruption

2.a Think of several meaningful questions that can be answered with these data, make several informative visualisations to answer those questions. State any assumptions and motivate decisions that you make when selecting data to be plotted, and in combining data.
Questions that can be answered with these data :
1. What is the correlation between corruption and economic 
development?

2. How satisfied are people with their lives in different continent? How life 
satisfaction effect on life expectancy?

3. What is the relationship between the HDI and life  
expectancy?

4. What is the relationship between Gross Domestic Product 
(GDP) and the Human Development Index (HDI)?

Question 1: What is the correlation between corruption and economic development?

It can be concluded from the graph above there was a high score in Eastern Europe and Russia, and Latin America. Indeed, the country with the most increased corruption Perception Rating is Mongolia, and the lowest corruption Perception Rating is Rwanda.

It is clear to see that there is a curve in the plot of the corruption and GDP per capita. The group of the group of low to middle income countries ( GDP per capita below 30k USD per year) is tend to have a high corruption perception rating. While the top 5 highest GDP countries have 3 Europe countries (Norway, Switzerland, Luxemburg) have the corruption less than 3.5. We can assume that for most of low to middle income class countries the corruption scale does affect with the GDP of the country.

Question 2: How satisfied are people with their lives in different continent? How life satisfaction effect on life expectancy?

As we can see from the figure above, Asia had on average higher levels of satisfaction allover the year of 2015, while Oceania and North America had much lower levels of satisfaction compared to Asia.

The presented graph ilustrates the correlation between Life satisfaction and Life expectancy on different continents. As can be seen, most of the Africa and Asia countries have lower life expectancy as well as the life satisfaction in 2015. On the other hand, the countries of Europe and South America have higher life expectancy and life satisfaction. In conclusion, there was a strong relationship between Life expectancy and Life satisfaction.

Question 3: What is the relationship between the HDI and life expectancy?

It can be concleded from the graph above that, even Norway have the highest HDI score but the Life expectancy is 82 lower than Janpan which have lower HDI score. Because the sample of the data set we used to plot was very few (top 20 highest Human Development Index Countries over the world in 2016) lead to we hardly see the relationship between HDI and Life expectancy.

We decided to make an corr heat map to check the relationship between those two variables

Apparently from the heat map above, the correlation between HDI and Life expectancy is 0.91 which is very high positve linear correlation. To summarise, there is a strong relationship between HDI and Life expectancy

Question 4. What is the relationship between Gross Domestic Product (GDP) and the Human Development Index (HDI)

Here you can see that Norway has the highest GDP and a higher HDI whilst Denmark has a quite average GDP but a lower HDI. So we can say thay HDI is one of the variable that influences the GDP.

It can be easily seen that the level of HDI can affect GDP per capita and also on the other way around Economic growth can lead to increase the Human develop index.